A Methodology to Segment the Text for Index Terms

نویسندگان

  • Muhammad Shoaib
  • Abad Ali Shah
چکیده

The problem of information overload is hot issue with growth of worldwide web. The need for the tools those should be able to absorb this huge information and eliminate this problem is evident especially for IR systems. Text is not a simple sequence of words but carries a structure. It is essential to handle these uncontrollable complex structures of sentence, grammatical and lexical irrelevancy of different units. The main idea to handle these problems is to segment the text into elementary units, which will be simpler and lesser complex than their equivalent text. We have used cue phrases, punctuations. We are presenting an algorithm, which is not only efficient but also handling more than 500 cue phrases and most of punctuations. This proposed algorithm can yield elementary units, which can be used by Rhetorical Relations Finder to get relations among them, which can be used by RST Parser for the construction of RST Tree which will be used to design an RST based indexer. In future, algorithm can be enhanced for handling other discourse markers, which will enable us to handle the more complex cases where cue phrases and punctuations are not applicable.

برای دانلود متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

منابع مشابه

مدل دو مرحله ای شکاف- گلچین برای نمایه سازی خودکار متون فارسی

Purpose: Each language has its own problems. This leads to consider appropriate models for automatic indexing of every language. These models should concern the exhaustificity and specificity of indexing.   This paper aims at introduction and evaluation of a model which is suited for Persian automatic indexing. This model suggests to break the text into the particles of candidate terms and to c...

متن کامل

رفع اعوجاج هندسی متون به‌کمک اطلاعات هندسی خطوط متن

Document images produced by scanners or digital cameras usually have photometric and geometric distortions. If either of these effects distorts document, recognition of words from such a document image using OCR is subject to errors. In this paper we propose a novel approach to significantly remove geometric distortion from document images. In this method first we extract document lines from do...

متن کامل

Reading and Assessing the City / Neighborhood FabricAs a Text. Case Study: Sar-Tapulah Historical Neighbourhood inSanandaj

From a linguistic point of view, the city can be seen as a text, consisting of different components and structures being related to each other beyond a sentence. Looking at the city from this point of view, what establishes a syntactic relationship and cohesion and coherence of the components of the city as a common language is called the syntax of the city. Linguistic study of the text of the ...

متن کامل

بررسی نقش انواع بافتار هم‌نویسه‌ها در تعیین شباهت بین مدارک

Aim: Automatic information retrieval is based on the assumption that texts contain content or structural elements that can be used in word sense disambiguation and thereby improving the effectiveness of the results retrieved. Homographs are among the words requiring sense disambiguation. Depending on their roles and positions in texts, homograph contexts could be divided to different types, wit...

متن کامل

Semiotic Analysis of Written Signs in the Road Sign Systems of Tehran City

Introduction: as a component of the urban landscape, road sign systems are among the most critical elements of urban environments. Generally speaking, the written signs dominate the design of these systems. These signs can also foster aesthetic and visual pleasure compellingly and innovatively. Furthermore, they perpetuate a specific image in the minds of their observers. This research seeks to...

متن کامل

ذخیره در منابع من


  با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید

برای دانلود متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

عنوان ژورنال:

دوره   شماره 

صفحات  -

تاریخ انتشار 2005